This episode is intended for intermediate learners of R who wish to
explore Sentiment Analysis. You can follow and execute individual chunks
in this rmarkdown document and analyze the emotional
loading of the IPCC (International Panel on Climate Change) Special
Report on Climate Change. Once you understand the digital workflow you
can analyze digital text of your choice. Ask questions whenever chunks
do not render or produce confusing outputs.
We start first with IPCC text loading and data wrangling, and
introduce basic text-mining concepts. Then we spend the bulk of time
demonstrating different kinds of sentiment measurement with R tools
(tidytext). We visualize the results in order to assess the
strengths and shortcomings of these approaches for different research
tasks.
A fantastic resource on tools and concepts is Julia Silge and David Robinson’s Text Mining with R.
Another text to accompany and further explain these concepts is by Nina Tahmasebi and Simon Hengchen (2019) The Strengths and Pitfalls of Large-Scale Text Mining for Literary studies, Samlaren
# Load general libraries
library(tidyverse)
library(here)
# Load libraries for text mining:
library(pdftools)
library(tidytext)
library(textdata)
library(ggwordcloud)ipcc_path <- here("data","ipcc_gw_15.pdf")
ipcc_text <- pdf_text(ipcc_path)Some things to notice:
Example: Just want to get text from a single page (e.g. Page 9)?
ipcc_p9 <- ipcc_text[9]
ipcc_p9[1] " Summary for Policymakers\n\n\n\n\nWe would also like to thank Abdalah Mokssit, Secretary of the IPCC, and the staff of the\nIPCC Secretariat: Kerstin Stendahl, Jonathan Lynn, Sophie Schlingemann, Judith Ewa, Mxolisi\nShongwe, Jesbin Baidya, Werani Zabula, Nina Peeva, Joelle Fernandez, Annie Courtin, Laura\nBiagioni and Oksana Ekzarho. Thanks are due to Elhousseine Gouaini who served as the SPM\nconference officer for the 48th Session of the IPCC.\n\n\nFinally, our particular appreciation goes to the Working Group Technical Support Units\nwhose tireless dedication, professionalism and enthusiasm led the production of this\nSpecial Report. This report could not have been prepared without the commitment of\nmembers of the Working Group I Technical Support Unit, all new to the IPCC, who rose\nto the unprecedented Sixth Assessment Report challenge and were pivotal in all aspects\nof the preparation of the Report: Yang Chen, Sarah Connors, Melissa Gomis, Elisabeth\nLonnoy, Robin Matthews, Wilfran Moufouma-Okia, Clotilde Péan, Roz Pidcock, Anna Pirani,\nNicholas Reay, Tim Waterfield, and Xiao Zhou. Our warmest thanks go to the collegial and\ncollaborative support provided by Marlies Craig, Andrew Okem, Jan Petzold, Melinda Tignor\nand Nora Weyer from the WGII Technical Support Unit and Bhushan Kankal, Suvadip Neogi\nand Joana Portugal Pereira from the WGIII Technical Support Unit. A special thanks goes\nto Kenny Coventry, Harmen Gudde, Irene Lorenzoni, and Stuart Jenkins for their support\nwith the figures in the Summary for Policymakers, as well as Nigel Hawtin for graphical\nsupport of the Report. In addition, the following contributions are gratefully acknowledged:\nJatinder Padda (copy edit), Melissa Dawes (copy edit), Marilyn Anderson (index), Vincent\nGrégoire (layout) and Sarah le Rouzic (intern).\n\n\nThe Special Report website has been developed by Habitat 7, led by Jamie Herring, and\nthe report content has been prepared and managed for the website by Nicholas Reay and\nTim Waterfield. We gratefully acknowledge the UN Foundation for supporting the website\ndevelopment.\n\n\n\n\n 5\n"
See how that compares to the text in the PDF on Page 9. What has
pdftools library added and where?
\n)
using stringr::str_split()tidyr::unnest()stringr::str_trim()ipcc_df <- data.frame(ipcc_text) %>%
mutate(text_full = str_split(ipcc_text, pattern = '\n')) %>%
unnest(text_full) %>%
mutate(text_full = str_trim(text_full))
# Why '\\n' instead of '\n'? Because some symbols (e.g. \, *) need to be called literally with a starting \ to escape the regular expression. For example, \\a for a string actually contains \a. So the string that represents the regular expression '\n' is actually '\\n'.
# Although, this time round, it is working for me with \n alone. Wonders never cease.
# More information: https://cran.r-project.org/web/packages/stringr/vignettes/regular-expressions.htmlNow each line, on each page, is its own row, with extra starting & trailing spaces removed.
Use tidytext::unnest_tokens() (which pulls from the
tokenizer) package, to split columns into tokens. We are
interested in words, so that’s the token we’ll use:
ipcc_tokens <- ipcc_df %>%
unnest_tokens(word, text_full)
ipcc_tokens# A tibble: 15,151 × 2
ipcc_text word
<chr> <chr>
1 " Global warming of 1.5°C\n An IPCC Special Report on the… glob…
2 " Global warming of 1.5°C\n An IPCC Special Report on the… warm…
3 " Global warming of 1.5°C\n An IPCC Special Report on the… of
4 " Global warming of 1.5°C\n An IPCC Special Report on the… 1.5
5 " Global warming of 1.5°C\n An IPCC Special Report on the… c
6 " Global warming of 1.5°C\n An IPCC Special Report on the… an
7 " Global warming of 1.5°C\n An IPCC Special Report on the… ipcc
8 " Global warming of 1.5°C\n An IPCC Special Report on the… spec…
9 " Global warming of 1.5°C\n An IPCC Special Report on the… repo…
10 " Global warming of 1.5°C\n An IPCC Special Report on the… on
# … with 15,141 more rows
# See how this differs from `ipcc_df`
# Each word has its own row!Let’s count the words!
ipcc_wc <- ipcc_tokens %>%
count(word) %>%
arrange(-n)
ipcc_wc# A tibble: 2,413 × 2
word n
<chr> <int>
1 and 616
2 the 505
3 of 476
4 to 407
5 in 352
6 c 283
7 global 223
8 confidence 213
9 warming 188
10 for 174
# … with 2,403 more rows
OK…so we notice that a whole bunch of things show up frequently that we might not be interested in (“a”, “the”, “and”, etc.). These are called stop words. Let’s remove them.
See ?stop_words and View(stop_words)to look
at documentation for stop words lexicons.
We will remove stop words using
tidyr::anti_join():
ipcc_stop <- ipcc_tokens %>%
anti_join(stop_words) %>%
select(-ipcc_text)Now check the counts again:
ipcc_swc <- ipcc_stop %>%
count(word) %>%
arrange(-n)What if we want to get rid of all the numbers (non-text) in
ipcc_stop?
# This code will filter out numbers by asking:
# If you convert to as.numeric, is it NA (meaning those words)?
# If it IS NA (is.na), then keep it (so all words are kept)
# Anything that is converted to a number is removed
ipcc_no_numeric <- ipcc_stop %>%
filter(is.na(as.numeric(word)))See more: https://cran.r-project.org/web/packages/ggwordcloud/vignettes/ggwordcloud.html
# There are almost 2000 unique words
length(unique(ipcc_no_numeric$word))[1] 1919
# We probably don't want to include them all in a word cloud. Let's filter to only include the top 100 most frequent?
ipcc_top100 <- ipcc_no_numeric %>%
count(word) %>%
arrange(-n) %>%
head(100)ipcc_cloud <- ggplot(data = ipcc_top100, aes(label = word)) +
geom_text_wordcloud() +
theme_minimal()
ipcc_cloudThat’s underwhelming. Let’s customize it a bit:
ggplot(data = ipcc_top100, aes(label = word, size = n)) +
geom_text_wordcloud_area(aes(color = n), shape = "diamond") +
scale_size_area(max_size = 12) +
scale_color_gradientn(colors = c("darkgreen","blue","red")) +
theme_minimal()Cool! And you can facet wrap (for different reports, for example) and update other aesthetics. See more here: https://cran.r-project.org/web/packages/ggwordcloud/vignettes/ggwordcloud.html
First, check out the ‘sentiments’ lexicon. Julia Silge and David Robinson in their book say that:
“The three general-purpose lexicons are
All three of these lexicons are based on unigrams, i.e., single
words. These lexicons contain many English words and the words are
assigned scores for positive/negative sentiment, and also possibly
emotions like joy, anger, sadness, and so forth. The AFINN
lexicon assigns words with a score that runs between -5 and 5, with
negative scores indicating negative sentiment and positive scores
indicating positive sentiment. The bing lexicon categorizes
words in a binary fashion into positive and negative categories. The
nrc lexicon categorizes words in a binary fashion
(“yes”/“no”) into categories of positive, negative, anger, anticipation,
disgust, fear, joy, sadness, surprise, and trust. All of this
information is tabulated in the sentiments dataset, and tidytext
provides a function get_sentiments() to get specific
sentiment lexicons without the columns that are not used in that
lexicon.”
Let’s explore the sentiment lexicons. “bing” is included in the
tidytext library, other lexicons (“afinn”, “nrc”,
“loughran”) you’ll be prompted to download the first time you use
them.
# Attach tidytext and textdata packages
# Uncomment the line below the first time you install the nrc dictionary
# get_sentiments(lexicon = "nrc")
# When you get prompted to install lexicon - choose yes!
# Uncomment the line below the first time you install the afinn dictionary
# get_sentiments(lexicon = "afinn")
# When you get prompted to install lexicon - choose yes!afinn: Words ranked from -5 (very negative) to +5 (very
positive) http://corpustext.com/reference/sentiment_afinn.html
get_sentiments(lexicon = "afinn")# A tibble: 2,477 × 2
word value
<chr> <dbl>
1 abandon -2
2 abandoned -2
3 abandons -2
4 abducted -2
5 abduction -2
6 abductions -2
7 abhor -3
8 abhorred -3
9 abhorrent -3
10 abhors -3
# … with 2,467 more rows
# Note: may be prompted to download (yes)
# Let's look at the pretty positive words:
afinn_pos <- get_sentiments("afinn") %>%
filter(value %in% c(3,4,5))
# Do not look at negative words in class.
afinn_pos# A tibble: 222 × 2
word value
<chr> <dbl>
1 admire 3
2 admired 3
3 admires 3
4 admiring 3
5 adorable 3
6 adore 3
7 adored 3
8 adores 3
9 affection 3
10 affectionate 3
# … with 212 more rows
bing: binary, “positive” or “negative” words. https://search.r-project.org/CRAN/refmans/textdata/html/lexicon_bing.html
get_sentiments(lexicon = "bing")# A tibble: 6,786 × 2
word sentiment
<chr> <chr>
1 2-faces negative
2 abnormal negative
3 abolish negative
4 abominable negative
5 abominably negative
6 abominate negative
7 abomination negative
8 abort negative
9 aborted negative
10 aborts negative
# … with 6,776 more rows
nrc: Includes bins for 8 emotions (anger, anticipation,
disgust, fear, joy, sadness, surprise, trust) and positive /
negative.
https://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm
get_sentiments(lexicon = "nrc")# A tibble: 13,872 × 2
word sentiment
<chr> <chr>
1 abacus trust
2 abandon fear
3 abandon negative
4 abandon sadness
5 abandoned anger
6 abandoned fear
7 abandoned negative
8 abandoned sadness
9 abandonment anger
10 abandonment fear
# … with 13,862 more rows
Citations for all the lexicons
Crowdsourcing a Word-Emotion Association Lexicon, Saif Mohammad and Peter Turney, Computational Intelligence, 29 (3), 436-465, 2013.
Finn Årup Nielsen A new ANEW: Evaluation of a word list for sentiment analysis in microblogs. Proceedings of the ESWC2011 Workshop on ‘Making Sense of Microposts’: Big things come in small packages 718 in CEUR Workshop Proceedings 93-98. 2011 May. http://arxiv.org/abs/1103.2903.
Minqing Hu and Bing Liu, “Mining and summarizing customer reviews.”, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2004), 2004.Let’s do sentiment analysis on the IPCC text data using the
afinn and nrc lexicons.
First, bind words in ipcc_stop to afinn
lexicon:
ipcc_afinn <- ipcc_stop %>%
inner_join(get_sentiments("afinn"))Let’s find some counts (by sentiment ranking):
ipcc_afinn_hist <- ipcc_afinn %>%
count(value)
# Plot them:
ggplot(data = ipcc_afinn_hist, aes(x = value, y = n)) +
geom_col() +
theme_bw()Investigate some of the words in a bit more depth:
# What are these '2' words?
ipcc_afinn2 <- ipcc_afinn %>%
filter(value == 2)# Check the unique 2-score words:
unique(ipcc_afinn2$word) [1] "strengthening" "support" "inspired" "integrity" "sincere"
[6] "appreciation" "generous" "supported" "commitment" "confidence"
[11] "determined" "solid" "supports" "opportunities" "robust"
[16] "growth" "benefits" "ability" "comprehensive" "assets"
[21] "importance" "improved" "effective" "healthy" "strong"
[26] "strengthened" "carefully" "improving" "clean" "responsible"
[31] "positive" "strength" "peace" "justice" "resolve"
[36] "asset" "secure" "ambitious" "innovative" "strengthen"
# Count & plot them
ipcc_afinn2_n <- ipcc_afinn2 %>%
count(word, sort = TRUE) %>%
mutate(word = fct_reorder(factor(word), n))
ggplot(data = ipcc_afinn2_n, aes(x = word, y = n)) +
geom_col() +
coord_flip() +
theme_bw()# OK so what's the deal with confidence? And is it really "positive" in the emotion sense? Look back at the IPCC report, and search for “confidence.” Is it typically associated with emotion, or something else?
We learn something important from this example: Just using a sentiment lexicon to match words will not differentiate between different uses of the word…(ML can start figuring it out with context, but we won’t do that here).
Or we can summarize sentiment for the report:
ipcc_summary <- ipcc_afinn %>%
summarize(
mean_score = mean(value),
median_score = median(value)
)The mean and median indicate slightly positive overall sentiments based on the AFINN lexicon.
We can use the nrc lexicon to start “binning” text by
the feelings they’re typically associated with. As above, we’ll use
inner_join() to combine the IPCC non-stopword text with the
nrc lexicon:
ipcc_nrc <- ipcc_stop %>%
inner_join(get_sentiments("nrc"))Wait, won’t that exclude some of the words in our text? YES! We
should check which are excluded using anti_join():
ipcc_exclude <- ipcc_stop %>%
anti_join(get_sentiments("nrc"))
# View(ipcc_exclude)
# Count to find the most excluded:
ipcc_exclude_n <- ipcc_exclude %>%
count(word, sort = TRUE)
head(ipcc_exclude_n)# A tibble: 6 × 2
word n
<chr> <int>
1 global 223
2 warming 188
3 1.5 169
4 pathways 111
5 chapter 103
6 2 95
Lesson: always check which words are EXCLUDED in sentiment analysis using a pre-built lexicon!
Now find some counts:
ipcc_nrc_n <- ipcc_nrc %>%
count(sentiment, sort = TRUE)
# And plot them:
ggplot(data = ipcc_nrc_n, aes(x = sentiment, y = n)) +
geom_col()+
theme_bw()Or count by sentiment and word, then facet:
ipcc_nrc_n5 <- ipcc_nrc %>%
count(word,sentiment, sort = TRUE) %>%
group_by(sentiment) %>%
top_n(5) %>%
ungroup()
ipcc_nrc_gg <- ggplot(data = ipcc_nrc_n5, aes(x = reorder(word,n), y = n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, ncol = 2, scales = "free") +
coord_flip() +
theme_minimal() +
labs(x = "Word", y = "count")
# Show it
ipcc_nrc_gg# Save it
ggsave(plot = ipcc_nrc_gg,
here("figures","ipcc_nrc_sentiment.png"),
height = 8,
width = 5)Wait, so “confidence” is showing up in NRC lexicon as “fear”? Let’s check:
conf <- get_sentiments(lexicon = "nrc") %>%
filter(word == "confidence")
# Yep, check it out:
conf# A tibble: 4 × 2
word sentiment
<chr> <chr>
1 confidence fear
2 confidence joy
3 confidence positive
4 confidence trust
There are serious limitations of sentiment analysis depending on what existing lexicons you use. You should think really hard about your findings and if a lexicon makes sense for your study. Otherwise, word counts and exploration alone can be useful!
Choose one of the tasks below to practice your newly acquired sentiment analysis skills:
Taking this script as a point of departure, apply sentiment analysis on the Game of Thrones. You will find a GOT.pdf in the data folder. What are the most common meaningful words and what emotions do you expect will dominate this volume? Are there any terms that are similarly ambiguous to the ‘confidence’ above?
Choose an English text of your own and subject it to sentiment analysis. For example, you can use the Arabian Nights from lesson 08-text-analysis.Rmd
Choose a Danish text of your preference and analyze it. Beware that for each language you need an appropriate sentiment dictionary. For Danish there is the ‘sentida’ package, available at https://github.com/Guscode/Sentida. The downloading instructions are available in the Readme - ask your instructors for clarification.
This tutorial is inspired by Allison Horst’s Advanced Statistics and Data Analysis.